61 research outputs found

    Fizzy: feature subset selection for metagenomics

    Get PDF
    BACKGROUND: Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α- & β-diversity. Feature subset selection - a sub-field of machine learning - can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate between age groups in the human gut microbiome. RESULTS: We have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets. CONCLUSIONS: We have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.This item is part of the UA Faculty Publications collection. For more information this item or other items in the UA Campus Repository, contact the University of Arizona Libraries at [email protected]

    Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms

    Get PDF
    BACKGROUND: Currently, the naĂŻve Bayesian classifier provided by the Ribosomal Database Project (RDP) is one of the most widely used tools to classify 16S rRNA sequences, mainly collected from environmental samples. We show that RDP has 97+% assignment accuracy and is fast for 250 bp and longer reads when the read originates from a taxon known to the database. Because most environmental samples will contain organisms from taxa whose 16S rRNA genes have not been previously sequenced, we aim to benchmark how well the RDP classifier and other competing methods can discriminate these novel taxa from known taxa. PRINCIPAL FINDINGS: Because each fragment is assigned a score (containing likelihood or confidence information such as the boostrap score in the RDP classifier), we "train" a threshold to discriminate between novel and known organisms and observe its performance on a test set. The threshold that we determine tends to be conservative (low sensitivity but high specificity) for naĂŻve Bayesian methods. Nonetheless, our method performs better with the RDP classifier than the other methods tested, measured by the f-measure and the area-under-the-curve on the receiver operating characteristic of the test set. By constraining the database to well-represented genera, sensitivity improves 3-15%. Finally, we show that the detector is a good predictor to determine novel abundant taxa (especially for finer levels of taxonomy where novelty is more likely to be present). CONCLUSIONS: We conclude that selecting a read-length appropriate RDP bootstrap score can significantly reduce the search space for identifying novel genera and higher levels in taxonomy. In addition, having a well-represented database significantly improves performance while having genera that are "highly" similar does not make a significant improvement. On a real dataset from an Amazon Terra Preta soil sample, we show that the detector can predict (or correlates to) whether novel sequences will be assigned to new taxa when the RDP database "doubles" in the future

    Additional file 1: Figure S1. of Marker genes that are less conserved in their sequences are useful for predicting genome-wide similarity levels between closely related prokaryotic strains

    No full text
    The number of genomes in which each marker gene is identified. Out of the 79 potential marker genes, 73 are present in at least 90 % of the genomes. Figure S2. Spearman’s correlation between each marker gene and the average AAI for all complete genomes. Genes are ordered in the same way as in Fig. 3. Figure S3. Trees generated based on AAI and on percent identities of each marker gene (including 16s rRNA), for the Escherichia/Shigella clade. Figure S4. Trees generated based on AAI and on percent identities of each marker gene (including 16s rRNA), for the Streptococcus clade. Figure S5. Trees generated based on AAI and on percent identities of each marker gene (including 16s rRNA), for the Bacillus clade. Table S1. List of 79 potential marker genes surveyed, out of which 73 were found to be present in at least 90 % of the genomes. Table S2. Alternative names of 79 potential marker genes surveyed. Table S3. Split distances between UPGMA tree generated using AAI and that generated using the percent identities of each marker gene, shown in correspondence with the average percent identity ranks of the marker genes. Table S4. Designed primers for each of the 10 genes that were least conserved in their sequences in the Escherichia/Shigella lineage. (ZIP 3355 kb

    The taxa breakdown when testing on all the taxonomic levels.

    No full text
    <p> <i>While genera have 29% novel representation in the test set (in 15% of the sequences mentioned in </i><a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0032491#pone-0032491-g002" target="_blank"><i>Fig. 2</i></a><i>), 14.3% of the families are novel (in 15% of the sequences), 11% of the orders are novel (in 5% of the sequences), 4% of the classes are novel (in 2.4% of the sequences), and 5% of the phyla are novel (in 0.07% of the sequences).</i></p

    The abundance numbers after each step in Fig. 8{16}.

    No full text
    <p>The novelty predicted by the detector is correlated to a decrease in abundant (over-500 occurrences) taxa, when using the RDP trained on the full (future) database.</p

    The Sensitivity, Specificity, and F-measure for different read-lengths comparing novel-known detection at genus-level and higher (where each rank is trained separately).

    No full text
    <p>The Sensitivity, Specificity, and F-measure for different read-lengths comparing novel-known detection at genus-level and higher (where each rank is trained separately).</p

    The ROC curve for 4 different novel/known detection methods using the 500 bp read test dataset at the genus-level.

    No full text
    <p>The naĂŻve Bayesian methods perform better (higher AUC) than Phymm(BL). The threshold (f-measure) determined chosen from the training data is shown with a blue dot.</p

    Setup for the “half-fold” experiments where half the sequences were used for training and half for testing.

    No full text
    <p>Setup for the “half-fold” experiments where half the sequences were used for training and half for testing.</p
    • …
    corecore